Random Forest variable importance with missing data
نویسندگان
چکیده
Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations.
منابع مشابه
Variable selection with Random Forests for missing data
Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a n...
متن کاملUsing Random Forests and Fuzzy Logic for Automated Storm Type Identification
This paper discusses how random forests, ensembles of weakly-correlated decision trees, can be used in concert with fuzzy logic concepts to both classify storm types based on a number of radar-derived storm characteristics and provide a measure of “confidence” in the resulting classifications. The random forest technique provides measures of variable importance and interactions, as well as meth...
متن کاملVariable Selection from Random Forests: Application to Gene Expression Data
Random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its use for ge...
متن کاملComparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be...
متن کاملRandom forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations
The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, n...
متن کامل